Bioinformatics (Thomas Dandekar, Meik Kunz)

167

protein families and associated proteins, which can be found in ever new combinations of

domains. An overview of protein domain families can be found in the SMART database

(https://smart.embl-heidelberg.de). Alternatively, one uses the “conserved domains”,

which enables independent verification (Lu et al. 2020).

If one wanted to look at the underlying genes, the “clusters of orthologous groups”

(COGs) first gave an overview starting from bacteria (https://www.ncbi.nlm.nih.gov/

COG/). Then eukaryotic gene groups were also considered (COGs; eukaryotic ortholo

gous groups; https://mycocosm.jgi.doe.gov/help/kogbrowser.jsf). In this context, a group

cluster of genes means that the same gene is found in very many organisms and thus the

same protein is always required and encoded in very different organisms: An “ortholog”

because the domain composition is the same. Eventually, these orthologous groups were

systematically extended, called eggNOGs (Huerta-Cepas et al. 2017). Excitingly, it can

also be well shown that the original richness of forms was much smaller, because the

primitive cell underlying all present-day life (the “LUCA”, last universal common

ancestor; Weiss et al. 2018), already using the same genetic code, had only about

1000–1500 proteins, which are still found today as highly conserved protein families in

virtually all organisms (similar, but not completely congruent, with the COGs). The protein

language is universal and only grew from a relatively manageable inventory to its current

richness over billions of years of evolution.

In order for everything to relate correctly to each other at the next higher level, the level

of protein networks, there is considerable biological redundancy and robustness. This is

necessary to ensure that every signal is correctly understood and does not get lost in the

noise (see Chap. 7):

Signals are further amplified in signal cascades. All this can be deciphered by network

analysis. This is a very efficient way of finding central proteins (hubs) that have a large

number of neighbours (e.g. network analysis with Cytoscape). The structure of the net

work also detects interfering signals as well as modifying and reciprocal input (cross-talk).

A fascinating and illustrative example can be found at the KEGG (Kyoto Encyclopedia

of Genes and Genomes) pathway database. These are the “maps of cancer pathways”,

which illustrate important stages of cancer (supporting and inhibiting pathways) for the

user, whereby one can look at the different pathway inventory for different organisms.

Building on these foundations, modeling pathways in cancer development and finding bet

ter drugs against them is certainly a fascinating topic in bioinformatics (see Chap. 13).

Again, the contextuality of all molecules helps to systematically identify the promoting

and inhibiting pathways, for example, by gene expression analyses of healthy cells and

cancer cells (where thus almost all important observed changes in gene expression interact

to further spark cancer).

Redundancy is also reflected by the fact that several synthetic pathways are possible in

metabolic networks for important and many other metabolites. This simultaneously pro

tects against numerous genetic mutations that could otherwise disrupt the network, but

also allows us to cope much better with fluctuations in the metabolites present in the

environment.

12.2 Printing Errors Are Constantly Selected Away in the Cell